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Abstract —Nonnegative matrix factorization (NMF) is a pow¬ 
erful class of feature extraction techniques that has been suc¬ 
cessfully applied in many fields, namely in signal and image 
processing. Current NMF techniques have been limited to a 
single-objective problem in either its linear or nonlinear kernel- 
based formulation. In this paper, we propose to revisit the 
NMF as a multi-objective problem, in particular a bi-objective 
one, where the objective functions defined in both input and 
feature spaces are taken into account. By taking the advantage of 
the sum-weighted method from the literature of multi-objective 
optimization, the proposed bi-objective NMF determines a set 
of nondominated, Pareto optimal, solutions instead of a single 
optimal decomposition. Moreover, the corresponding Pareto front 
is studied and approximated. Experimental results on unmixing 
real hyperspectral images confirm the efficiency of the proposed 
bi-objective NMF compared with the state-of-the-art methods. 

Index Terms —Kernel machines, nonnegative matrix factoriza¬ 
tion, Pareto optimal, hyperspectral image, unmixing problem. 

I. Introduction 


N onnegative matrix factorization (nmf) 

provides a parts-based representation for the nonnegative 
data entries, and has becoming a versatile technique with 
plenty of applications m. As opposed to other dimension¬ 
ality reduction approaches, e.g., principal component analysis, 
vector quantization and linear discriminant analysis, the NMF 
is based on the additivity of the contributions of the bases 
to approximate the original data. Such decomposition model 
often yields a tractable physical interpretation thanks to the 
sparse and nonnegative obtained representation of the input 
data. Many real world applications benefit from these virtues, 
including hyperspectral unmixing 0, 0, face and facial 
expression recognition 0. 0, gene expression data 0, blind 
source separation 0, and spectral clustering QU, 0, to name 
a few. 

The NMF approximates a high-rank nonnegative input ma¬ 
trix by two nonnegative low-rank ones. As a consequence, it 
provides a decomposition suitable for many signal processing 
and data analysis problems, and in particular the hyperspectral 
unmixing problem. Indeed, a hyperspectral image is a cube 
that consists of a set of images of the scene under scrutiny, 
each corresponding to a ground scene from which the light of 
certain wavelength is reflected. Namely, a reflectance spectral 
over a wavelength range is available for each pixel. It is 
assumed that each spectral is a mixture of a few “pure” 
materials, called endmembers. The hyperspectral unmixing 
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problem consists of extracting the endmembers (recorded 
in the first low-rank matrix), and estimating the abundance 
of each endmember at every pixel (recorded in the second 
one). Obviously, the above physical interpretation requires the 
nonnegativity on both abundances and endmember spectrums. 

The NMF is a linear model, since it can be viewed in 
a way that each input spectral is approximated by a linear 
combination of some basis spectrums. To estimate the decom¬ 
position, the objective function for minimization is defined in 
an Euclidean space — the so-called input space X —, where 
the difference between the input matrix and the product of the 
estimated ones is usually measured either by the Frobenius 
norm or by generalized Kullback-Leibler divergence ca. 
These objective functions are often augmented by including 
different regularization terms, such as the Fisher constraint 
for learning local features ED, the sparseness constraint for 
intuitive and easily interpretable decompositions fl2l , the 
temporal smoothness and spatial decorrelation regularization 
ED, and the minimum dispersion regularization for unmixing 
accuracy ED. Other objective functions are also raised from 
practical standpoints, e.g., the £i-norm for the robustness 
against outliers and missing data m and the Bregman di¬ 
vergence with fast computational performance ED. 

Many studies have shown the limits of a linear decompo¬ 
sition, as opposed to a nonlinear one ED, ED, m. While 
most research activities have been concentrated on the linear 
NMF, a few works have considered the nonlinear case. In 
an attempt to extent the linear NMF models to the nonlinear 
scope, several kernel-based NMF have been proposed within 
the framework offered by the kernel machines l20l . Employing 
a nonlinear function, the kernel-based methods mainly map the 
data into a higher dimensional space, where the existing linear 
techniques are performed on the transformed data. The kernel 
trick enables the estimation of the inner product between any 
pair of mapped data in a reproducing kernel Hilbert space 
— the so-called feature space H —, without the need of 
knowing explicitly neither the nonlinear map function nor 
the resulting space. For example, in l20l . the linear NMF 
technique is performed on the kernel matrix, whose entries 
consist of the inner products between input data calculated 
with some kernel function. Other kernel-based NMF tech¬ 
niques presented in |2Tj. 1221 . (23j follow a similar scheme 
but share an additive assumption originated from the convex 
NMF approach proposed in |2ll . that is, the basis matrix is 
represented as the convex combination of the mapped input 
data in the feature space TL. It is worth noting that the objective 
function is the Frobenius norm of the residual between the 
kernel matrix and its factorization, for all the above-mentioned 
kernel-based NMF methods. However, although the input data 
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matrix is nonnegative, the nonnegativity of the mapped data 
is not guaranteed. A more severe disadvantage is that the 
obtained bases lie in the feature space (often of infinite 
dimension), where a reverse mapping to the input space is 
difficult. Indeed, one needs to solve the pre-image problem, an 
obstacle inherited from the kernel machines (24). In (3), these 
difficulties are circumvented by defining a model in the feature 
space that can be optimized directly in the input space. In this 
paper, we revisit this framework to discover the nonlinearity 
of the input matrix. See Section Hl-B I for more details. 

In either its linear conventional formulation or its nonlinear 
kernel-based formulation, as well as all of their variations (and 
regularizations), the NMF has been tackling a single-objective 
optimization problem. In essence, the underlying assumption 
is that it is known in prior that the linear model dominates the 
nonlinear one, or vice versa, for the input data under study. 
To obtain such prior information about the given input data 
is not practical in most real-world applications. Moreover, it 
is possible that a fusion of the linear and nonlinear models 
reveals the latent variables closer to the ground truth than each 
single model considered alone. Independently from the NMF 
framework, such combination of the linear model with a non¬ 
linear fluctuation was recently studied by Chen, Richard and 
Honeine in [ 181 and (25) where, in the former, the nonlinearity 
depends only on the spectral content, while it is defined by a 
post-nonlinear model in the latter. A multiple-kernel learning 
approach was studied in (26) and a Bayesian approach was 
investigated in ED with the so-called residual component 
analysis. All these methods share one major drawback: they 
only consist in estimating the abundances, with a nonlinear 
model, while the endmembers need to be extracted in a pre¬ 
processing stage using any conventional linear technique (N- 
Findr, vertex component analysis, ... ED). As opposed to such 
separation in the optimization problems, the NMF provides an 
elegant framework for solving jointly the unmixing problem, 
namely estimating the endmembers and the abundances. To 
the best of our knowledge, there have been no previous studies 
that combine the linear and nonlinear models within the NMF 
framework. 

In this paper, we study the bi-objective optimization prob¬ 
lem that performs simultaneously the NMF in both input 
and feature spaces, by combining the linear and kernel-based 
models. The first objective function to optimize stems from the 
conventional linear NMF, while the second objective function, 
defined in the feature space, is derived from a kernel-based 
NMF model. In case of two conflicting objective functions, 
there exists rather a set of nondominated, noninferior or Pareto 
optimal solutions, as opposed to the unique decomposition 
when dealing exclusively with one objective function. In order 
to acquire the Pareto optimal solutions, we investigate the 
sum-weighed method from the literature of multi-objective 
optimization, due to its ease for being integrated to the 
proposed framework. Moreover, propose to approximate the 
corresponding Pareto front. The multiplicative update rules are 
derived for the resulting sub-optimization problem in the case 
where the feature space is induced by the Gaussian kernel. 
The convergence of the algorithm is discussed, as well as 
initialization and stopping criteria. 


The remainder of this paper is organized as follows. We 
first revisit the conventional and kernel-based NMF. The dif¬ 
ferences between the input and the feature space optimization 
are discussed in Section [III] In Section [IV] we present the 
proposed bi-objective NMF framework. Section [V] demon¬ 
strates the efficiency of the proposed method for unmixing 
two real hyperspectral images. Conclusions and future works 
are reported in Section [VT] 

II. A PRIMER ON THE LINEAR AND NONLINEAR NMF 

In this section, we present the two NMF variants, with the 
linear and the nonlinear models, as well as the corresponding 
optimization problems. 

A. Conventional NMF 

Given a nonnegative data matrix X € 3ff ixT , the conven¬ 
tional NMF aims to approximate it by the product of two 
low-rank nonnegative matrices E G 5R LxW and A G 5R WxT , 
namely 

X « EA , (1) 

under the constraints E > 0 and A > 0, where the non¬ 
negativity is element-wise. An equivalent vector-wise model 
is given by considering separately each column of the matrix 
X, namely x t for t = 1,..., T, with 

N 

*Et ~ ^ ^ O n t 
n=1 

where each x t is represented as a linear combination of the 
columns of E, denoted e n for n = 1,... ,N, with the scalars 
a n t for n = 1 ,... ,N and t = 1 ,... ,T being the entries of 
the matrix A. The subspace spanned by the vectors x t , as well 
as the vectors e n is denoted the input space X. 

To estimate both matrices E and A, one concentrates on 
the minimization of the Frobenius squared error norm i ||X — 
EA\\p, subject to E > 0 and A > 0. In its vector-wise 
formulation, the objective function to minimize is 

1 T N 

J X (E,A) = -^\\x t -^a nt e n \\ 2 , (2) 

1 n —1 

where the residual error is measured between each input vector 
x t and its approximation X^=i a nt^n in the input space 
X. The optimization is operated with a two-block coordinate 
descent scheme, by alternating between the elements of E or 
of A. while keeping the elements in the other matrix fixed. 

B. Nonlinear - kernel-based - NMF 

A straightforward generalization to the nonlinear form is 
proposed within the framework offered by the kernel ma¬ 
chines. In the following, we present the kernel-based NMF that 
we have recently proposed in 0. It is worth noting that other 
variants can also be investigated, including the ones studied 
in l2Qj . (21) . (22), (23) . However, these variants suffer from 
the pre-image problem, making the derivations and the study 
more difficult; see (29) for more details. 
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Consider a nonlinear function $(•) that maps the columns of 
the matrix X, as well as the columns of the matrix E, from 
the input space X to some feature space 'H. Its associated 
norm is denoted || • \\u, and the corresponding inner product 
in the feature space is of the form $(a; t /))^, which 

can be evaluated using the so-called kernel function K(x tl x t ') 
in kernel machines. Examples of kernel functions are the 
Gaussian and the polynomial kernels. 

Applying the model Q in the feature space, we get the 
following matrix factorization model 

[$(ati) $(a3 2 ) ••• 3 >(£Ct)] « [$(ei) <k(e 2 ) $(e N )}A, 


or equivalently in the vector-wise form, for all t = 1,..., T, 

N 

$(®t) « a nt $(e„). 

n =1 

Under the nonnegativity of all e ; v and entries of A, the 
optimization problem consists in minimizing the sum of the 
residual errors in the feature space PL, between each A>(xi) 
and its approximation ^ 2^=1 a nt $(e n ), namely 


N 


(3) 


ME, A) = - p(*t) - am $(e„) 

t—1 n =1 

By analogy to the linear case, a two-block coordinate descent 
scheme can be investigated to solve this optimization problem. 


III. Input versus feature space optimization 
The difference between the linear and the nonlinear cases 
is illustrated in Fig. [H With the linear NMF, each sample x t 
is approximated with a linear combination of the N elements 
e n , namely by minimizing the Euclidean distance in the input 
space between each x t and x t = a nt e„. With the 

nonlinear case, using the kernel-based formalism, the opti¬ 
mization is considered in the feature space, by minimizing the 
distance in PL between <h(:E t ) and = J2n=i a nt The 

two models, and the corresponding optimization problems, are 
distinct (except for the trivial linear kernel). 


A. The pre-image problem 

An attempt to bridge this gap is to provide a representation 
of in the input space, namely estimating the element of 
X whose image with the mapping function «1> (•) is as close 
as possible to ’I'j. This is the pre-image problem, which is an 
ill-posed nonlinear optimization problem; see |24j for more 
details. As shown in the literature investigating the pre-image 
problem, and demonstrated recently in lf30l Theorem 1], the 
pre-image takes the form J2n=i a nt for some unknown 
coefficients a! nt . These coefficients depend on the e„, making 
the model implicitly nonlinear. 

It is worth noting that this difference, between the linear and 
the nonlinear case, is inherited from the framework of kernel 
machines; see ED. This drawback spans also the multiple 
kernel learning models, of the form Yln =1 Unt K ( e «>') where 
the kernel k is a (convex) combination of several kernels ||32| . 
While we focus in this paper on the NMF, our work extends 
to the wide class of kernel methods, by providing a framework 
to optimize in both input and feature spaces, as shown next. 


$(•) 



Fig. 1: In the linear NMF, each sample x t is approximated by 
x t in the input space X, while in the kernel-based NMF, the 
mapped sample ( l>(x f ) is approximated by \& t in the feature 
space PL. The proposed bi-objective NMF solves simultane¬ 
ously the two optimization problems. 


B. On augmenting the linear model with a nonlinearity 

Recently, several works have been investigating the com¬ 
bination of a linear model, often advocated by a physical 
model, with an additive nonlinear fluctuation, determined with 
a kernel-based term. The model takes the form 

N 

Xt — ^ ^ ant ftn T 

n —1 

where belongs to some nonlinear feature space. Several 
models have been proposed to define this nonlinearity, as 
outlined here. In en, the nonlinearity depends exclusively 
on the endmembers e n . In ||26l , the above additive fluctuation 
is relaxed by considering a convex combination with the so- 
called multiple kernel learning. More recently, the abundances 
are incorporated in the nonlinear model, with a post-nonlinear 
model as studied in |25j and a Bayesian approach is used in 
lf27l with the so-called residual component analysis. Another 
model is proposed in ED in the context of supervised learning. 

All these approaches consider that the endmembers e n are 
already known, or estimated using some linear techniques such 
as the vertex component analysis (VCA) ll28l . The nonlinearity 
is only investigated within the abundances a n t■ As opposed 
to these approaches, the method considered in this paper 
investigates also the estimation of the endmembers e n , with a 
nonlinear relation with respect to it. 

IV. Bi-objective optimization in both input and 

FEATURE SPACES 

In this section, we propose to solve simultaneously the two 
optimization problems, in the input and the feature spaces. See 
Fig. m for an illustration. 


A. Problem formulation 

Optimizing simultaneously the objective functions 
Jx{E,A) and Ju(E,A), namely in both the input 
and the feature space, is in a sense an ill-defined problem. 
Indeed, it is not possible in general to find a common solution 
that is optimal for both objective functions. As opposed to the 
single-objective optimization problems where the main focus 
would be on the decision solution space, namely the space of 
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all entries (E, A) (of dimension LN + NT), the bi-objective 
optimization problem brings the focus on the objective 
space, namely the space to which the objective vector 
[Jx(E,A) J-y(E,A)\ belongs. Beyond such bi-objective 
optimization problem, multi-objective optimization has been 
widely studied in the literature. Before taking advantage of 
this literature study and solving our bi-objective optimization 
problem, we revisit the following definitions in our context: 

« Pareto dominance: The solution (E\,Ax) is said to 
dominate (E 2 ,A 2 ) if and only if Jx(E\ ,M) < 

Jx(E 2 ,A 2 ) and < Ju{E 2 , A 2 ), where at 

least one inequality is strict. 

• Pareto optimal: A given solution (E*, A*) is a global 
(respectively local) Pareto optimal if and only if it is not 
dominated by any other solution in the decision space 
(respectively in its neighborhood). That is, the objective 
vector [Jx(E* , A*) Ju{E* , A*)] corresponding to a 
Pareto optimal ( E *, A*) cannot be improved in any space 
(input or feature space) without any degradation in the 
other space. 

• Pareto front: The set of the objective vectors correspond¬ 
ing to the Pareto optimal solutions forms the Pareto front 
in the objective space. 

Various multi-objective optimization techniques have been 
proposed and successfully applied into engineering fields, e.g., 
the evolutionary algorithms l33l . sum-weighted algorithms 
01, £35), e-constraint method ESI, ED, normal boundary 
intersection method 138]. to name a few. See the survey (39], 
M and the references therein on the methods for multi¬ 
objective optimization. Among the existing methods, the sum- 
weighted or scalarization method has been always the most 
popular one, since it is straightforward and easily to imple¬ 
ment. A sum-weighted technique converts a multi-objective 
problem into a single-objective problem by combining the 
multiple objectives. Under some conditions, the objective 
vector corresponding to latter’s optimal solution belongs to the 
convex part of multi-objective problem’s Pareto front. Thus, 
by changing the weights among the objectives appropriately, 
the Pareto front of the original problem is approximated. The 
drawbacks of the sum-weighted method reside in that the 
nonconvex part of the Pareto front is unattainable, and even 
on the convex part of the front, a uniform spread of weights 
does not frequently result in a uniform spread of Pareto points 
on the Pareto front, as pointed out in (34l . Nevertheless, the 
sum-weighted method is the most practical one, in view of 
the complexity of the NMF problem, which is nonconvex, ill- 
posed and NP-hard. 

B. Bi-objective optimization with the sum-weighted method 

Following the formulation introduced in the previous 
section, we propose to minimize the bi-objective function 
[Jx{E,A) J-y{E,Aj\, under the nonnegativity of the 
matrices E and A. The decision solution, of size LN + NT, 
corresponds to the entries in the unknown matrices E and A. 
In the following, we use the sum-weighted method, which 
is the most widely-used approach to tackle multi-objective 


optimization. To this end, we transform the bi-objective prob¬ 
lem into an aggregated objective function which is a convex 
combination of the two original objective functions. Let 

J{E, A) = aJ x {E , A) + (1 - a)J n (E, A) 

be the aggregated objective function (i.e., sum-weighted objec¬ 
tive function, also called scalarization value) for some weight 
a G [0,1] that represents the relative importance between 
objectives Jx and Jy ■ The optimization problem becomes 

min aJx{E, A) + (1 — a) J-y[E, A) 

EA (4) 

subject to E > 0 and A > 0 

For a fixed value of the weight a, the above problem is 
called the suboptimization problem. The solution to the above 
suboptimization problem is a Pareto optimal for the original 
bi-objective problem, as proven in |34] for the general case. 
By solving the above suboptimization problem with a spread 
of values of the weight a, we obtain an approximation of the 
Pareto front. It is obvious that the model breaks down to the 
single-objective conventional NMF in (0 with a = 1, while 
the extreme case with a = 0 leads to the kernel NMF in 0. 

The optimization problem © has no closed-form solu¬ 
tion, a drawback inherited from optimization problems with 
nonnegativity constraints. Moreover, the objective function is 
nonconvex and nonlinear, making the optimization problem 
difficult to solve. In the following, we propose iterative 
techniques for this purpose. It is noteworthy to mention this 
yields an approximate optimal solution, whose objective vector 
approximates a point on the Pareto front. Substituting the 
expressions given in © and © for Jx and J-y, the aggregated 
objective function becomes 

T N 1 _ T 1 V 

j = — 22 3:4 - 22 ant en 4 — t ,— 22 pk**)~~ '22 ant ^( e «) 

(5) 

This objective function takes the form 

T N N N 

^min a 22 22 a ™t e I x t + 2 22 22 a nt a mteZ e T7i) 

t=l n= 1 ra=l m= 1 

T N ^ N N 

+ (1 — a) 22 (— 22 CLntftfen , Xt ) — 22 22 a ™ 4Q ™ 4K ( e ™> e ™))> 

t= 1 71=1 71 = 1 771=1 

(6) 

after expanding the expressions of the distances in ©. and 
removing the constant terms xjx t and n(x t ,x t ), since they 
are independent of a n t and e n . 

Although the NMF is nonconvex, its subproblem with one 
matrix fixed is convex. Similar to most NMF algorithms, 
we apply the two-block coordinate descent scheme, namely, 
alternating over the elements in E or in A, while keeping the 
elements in the other matrix fixed. The derivative of © with 
respect to a n t is 

N 

^’ant ^ ^ B n Xf “t“ 2 ' tluit G n G n 2j 

m=l (7) 

N v ’ 

A (1 trj ^ KiyG n , Xt) T 2 ^ CLmt G m , 

m =1 
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and the gradient of ([6]) with respect to e n is 

T N 

Ve n J — OL ^ ^ Cl n t ^ Xf + ^ ^ ^mt^rn^j 

t —1 m=1 

T N 

+ (i-«)E &nt(^ ^e n fci&m 't't') T ^ ^ flmfV2 n ^(^TD^m)j- 

t= 1 m=l 

( 8 ) 

Here, V en K,(e n , ■') represents the gradient of the kernel with 
respect to its argument e„, and can be determined for most 
valid kernels, as shown in TABLE Q] 

Based on the gradient descent scheme, a simple additive 
update rule can be written as 

&nt Grit Vnt ^ ’a nt ^ (9) 

for a n t, and 

= Vn^e. n J (10) 

for e n . The stepsize parameters r] nt and rj n balance the rate 
of convergence with the accuracy of optimization, and can be 
set differently depending on n and t. After each iteration, the 
rectification a n t = max(a„t, 0) should follow to guarantee the 
nonnegativity of all a nt and the entries in all e n . 

C. Multiplicative update rules for the Gaussian kernel 

The additive update rule is easy to implement but the 
convergence can be slow and very sensitive to the stepsize 
value, as pointed out by Lee and Seung in nni. Following the 
spirit of the latter paper, we provide multiplicative update rules 
for the proposed bi-objective NMF. Without loss of generality, 
we restrict the presentation to the case of the Gaussian kernel 
for the second objective function J-m ■ For most valid kernels, 
the corresponding multiplicative update rules can be derived 
using a similar procedure. 

The Gaussian kernel is defined by n(zi,Zj) = 
ex P(5i?HI z i ~ z j\\ 2 ) f° r an y z ii z :i £ X, where a denotes the 
tunable bandwidth parameter. In this case, its gradient with 
respect to e n is 

Ve„K(e„, 2 ) = — —K(e„,z)(e n - z). (11) 

To derive the multiplicative update rule for a n t , we choose the 
stepsize parameter in the additive update rule © as 

Q"nt 

V n t ~ ~ ' 

a E + (!-“) E Urnt n 5 €"m) 

m=1 m=1 

which yields 

ae^x t +(1 - a) K(e ni x t ) 
a nt — Unt X N N 

a E amte n e m + (1 — a) E Umt m ^-m) 

m=1 m=1 

( 12 ) 

Incorporating the above expression V e _ n Ac(e„. z) of the Gaus¬ 
sian kernel in ©, the gradient of the objective function with 
respect to e n becomes ( fl3l ) (given on top of next page). The 
multiplicative update rule for e n is elaborated using the so- 
called split gradient method ED- This trick decomposes the 


TABLE I: Some common kernels and their gradients with 
respect to e n 


Kernel 

K(e n ,z) 

V e „K(e n , z) 

Gaussian 

Polynomial 

Exponential 

Sigmoid 

ex P(2^ll e n - Z || 2 ) 
(z T e n + c) d 

ex P(^rl|en - z\\) 
tanh(72 T e n + c) 

--^ K -{ e n,z)(e n - z) 
d(z T e n +c)G—l) z 
- K.(e n , z )sgn(e„ - z ) 

7sech 2 (7z T e rl + c)z 


expression of the gradient © into the subtraction of two 
nonnegative terms, i.e., V e „ J = P — Q, where P and Q 
have nonnegative entries. To this end, we set the stepsize ?/ n 
corresponding to e n in © as <m (given on top of next 
page), and obtain the following multiplicative update for e n 
in (fl5l) (given on top of next page), where the division and 
multiplication are element-wise. 

On the convergence, initialization and stopping criteria 

The proposed algorithm tolerates the use of any strictly pos¬ 
itive matrices as initial matrices. A simple uniform distribution 
on the positive quadrant is shown to be a good initialization 
in our experiments. It is an advantage over some NMF 
algorithms where stricter initial conditions are required. For 
instance, proposed for the hyperspectral unmixing problem, 
both constrained NMF [|2j and minimum volume constrained 
NMF fl42l initialize the columns of the endmember matrix E 
with randomly chosen pixels from the image under study. 

For each given weight a, the stopping criterion is two-fold, 
either a stationary point is attained, or the preset maximum 
number of iterations is reached. Therefore, the algorithm stops 
at the r;-th iteration if 

< min{ 
or 

Tl = 71 max , 

where denotes the evaluation of the aggregation objective 
function J at the n-th iteration and n max is a predefined 
threshold. 

On the convergence of the proposed algorithm, it is notewor¬ 
thy to mention that the quotients in the multiplicative update 
rules (IT2T ) and (031 ) are unity if and only if 

V Qllt J = 0 and V e „ J = 0, 

respectively. Therefore, the above multiplicative update rules 
imply a part of the Karush-Kuhn-Tucker (KKT) conditions. 
However, the KKT conditions state only the necessary condi¬ 
tions for a local minimum. Concerning the nonconvex problem 
as the studied one, or any problem with non-unique KKT- 
points (stationary points), a local minimum is not guaranteed. 
Similar to other multiplicative-type update rules proposed in 
NMF, the proposed algorithm lacks guaranteed optimality 
property, since the convergence to a stationary point does not 
always correspond to a local minimum. See also the discus¬ 
sions around the convergence of the conventional NMF in |[43l . 
E3. Independently from these theoretical lack of convergence, 
we show next that the proposed algorithm provides relevant 
results, and also outperforms all state-of-the-art methods. 
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N 


^e n J — & ^ ^ tint ^ ^ ^ H~ 9 ^ ^ &nt ^(^nj ®t) (^n ®t) ^ ^ ^m) (* 

£=1 m=l i=l m=l 

&n 

Vn = ■ 


■n ^m 


a G‘ 


€-n — 1 


JL 1\ ± / IV \ 

Q-nt ^mt^-m ~\~ (1 ^nt ( H - n 5 ^m)^m j 

£=1 m—1 i=l ' m=l ' 

T T , N x 

Q!(T 2 J] a n tXt + (1 - a) s a„t («(e„, a; t )a: t + X] a, mt n{e m ^m)^n J 

_t=l_t=l V _m=l_'_ 

T N T , N v 

X/ a «t X^ ^mt^-m -|- (1 Of) tt n ^ f Av(c n , X^)e n -|- GuitHi^S 

nt ^m)^m j 

t=l m=l t=l ' m=l ' 


(13) 

(14) 


(15) 


a a- 


V. Experiments 

In this section, the performance of the proposed algorithm 
for bi-objective NMF is demonstrated on the unmixing of two 
well-known hyperspectral images. An approximation of the 
Pareto front is proposed and a comparison with state-of-the- 
art unmixing methods is conducted. 

A. Dataset and settings 

The first image, depicted in Fig. [2] is the Urban imageQ. 
acquired by the Hyperspectral Digital Imagery Collection 
Experiment (HYDICE) sensor. The top left part with 50 x 50 
pixels is taken from the original 307 x 307 pixels’ image. The 
raw data consists of 210 channels covering the bandwidth from 
0.4/zto to 2.5 pm. As recommended in B51 - only L = 162 
clean bands of high-SNR are of interest. According to the 
ground truth provided in B31 . l46l . the studied area is mainly 
composed by grass, tree, building, and road. 

The second image is a sub-image with 50 x 50 pixels 
selected from the well-known Cuprite image, which was ac¬ 
quired by the Airborne Visible/Infrared Imaging Spectrometer 
(AVIRIS) in 1997. The data is collected over 244 contiguous 
spectral bands, with the wavelength ranging from OApm to 
2.5 pm. After the removal of the noisy bands, L = 189 spectral 
bands remain. As investigated in (47], (48], this area is known 
to be dominated by three materials: muscovite, alunite and 
cuprite. 

Experiments are conducted employing the weight set a £ 
{0, 0.02,0.98,1}, which implies the model varying grad¬ 
ually from the nonlinear Gaussian NMF (a = 0) to the 
conventional linear NMF (a = 1). For each a from the weight 
set, multiplicative update rules given in (ITU and 03 are 
applied, with the maximum iteration number n max = 300. 
The initial matrices of E and A are generated using a [0, 
1] uniform distribution. To choose an appropriate bandwidth 
a in the Gaussian kernel, we first apply the single objective 
Gaussian NMF on both images, using the same candidate set 
{0.2,0.3,..., 9.9,10,15, 20,..., 50} for a. Considering the 
reconstruction error in both input and feature space (see below 
for definitions), we fix er = 3.0 for the Urban image, and 
a = 2.5 for the Cuprite image as investigated in [29]. 



Fig. 2: The ground truth of the Urban image 

B. Approximation of the Pareto front 

Since to determine the whole Pareto front is unrealistic for 
a nonlinear multi-objective optimization problem, one target is 
to approximate the Pareto front by a set of discrete points on 
it (39]. The concept of the Pareto optimal and the Pareto front 
are not strict in the proposed algorithm, due to the solver of the 
suboptimization problem not guaranteeing a local minimum, 
not to mention the global minimum. These obstacles are 
inherited from the nonconvexity and the nonlinearity of the 
kernel-based NMF problem. In this case, the Pareto optimal 
and the Pareto front refer actually to candidate Pareto optimal 
and an approximation of Pareto front, respectively [39]. 

To approximate the Pareto front with a discrete set of points, 
we operate as follows: For each value of the weight a, we 
obtain a solution (endmember and abundance matrices) from 
the algorithm proposed in Section IIV-CI by evaluating the 
objective functions Jx and Ju at this solution, we get a 
single point in the objective space. The approximated Pareto 
front for the Urban and the Cuprite images are shown in 
Fig. |3(a)| and Fig. |3(b)| The evolution of objectives Jx, Ju and 
the aggregated objective function J, evaluated at the solution 
obtained for each weight a, are shown in Fig. |4(a)| for the 
Urban image and in Fig. |4(b)| for the Cuprite image. 

We observe the following: 

1) For both images under study, solutions generated with 
a = 1 and a = 0 are dominated, since all the solutions 
on the Pareto front outperform them, with respect to both 


*The Urban image is available: http://www.erdc.usace.army.mil/Media/—FactSheets/B£gtSdltW6£ticld\li£Wft¥l£ial92tiliaAilfeI£fij®4hB'tt5<jlBiVellilljijial linear 
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Fig. 3: Illustration of the approximated Pareto front in the 
objective space for Urban and Cuprite images. The objective 
vectors of the non-dominated solutions (42 for the Urban 
image, 28 for the Cuprite image), marked in red, approximate a 
part of the Pareto front; the objective vectors of the dominated 
solutions (9 for the Urban image, 23 for the Cuprite image) 
are marked in blue. 


NMF nor the nonlinear Gaussian NMF best fits the studied 
images. On the contrary, the Pareto optimal solutions, 
which result the points on the Pareto front, provide a set of 
feasible and nondominated decompositions for the decision 
maker (DM), i.e., the user. It is worth noting that we apply 
the sum-weighted method as a posteriori method, where 
different Pareto optimal solutions are generated, and the 
DM makes the final comprise among optimal solutions. 
Alternatively, in a priori method, the DM specifies the 
weight a in advance to generate a solution. See f40l for 
more details. 


2) Regarding the sum-weighted approach, the minimizer of 
the suboptimization problem is proven to be a Pareto 
optimal for the original multi-objective problem, i.e., the 
corresponding objective vector belongs to the Pareto front 
in the objective space lf34l . In practice, we obtain 9 and 
23 (out of 51) dominated solutions for the Urban and the 
Cuprite images, respectively. Such phenomenon, however, 
is not surprising, since there exist multiple Pareto optimal 
solutions in a problem only if the objectives are conflict¬ 
ing to each other, as claimed in l49pl Other possible 
explanation could be the applied numerical optimization 
scheme, due to the weak convergence of the method or 
to the failure of the solver in finding a global minimum 
(40). For the Urban image as shown in Fig. |3(a)| and 
Fig. |4(a)| all the obtained solutions are Pareto optimal 
within the objectives-conflicting interval a G [0.18,0.94]. 
Regarding the Cuprite image, as observed in Fig. |3(b)| 
and Fig. |4(b)| the objectives-conflicting interval is a £ 
[0.14,0.72], while the Pareto optimal solutions are found 
using a G {0.04} U (0.26,0.28,..., 0.72}. In fact, the 
obtained solutions with a G {0.14,0.16, ...,0.24} are only 
local Pareto optimal, and they are dominated by a global 
Pareto optimal with a = 0.04. It is pointed out in j40l 
that, in the nonconvex problem, the global (local) solver 
generates global (local) Pareto optimal, and local Pareto 
optimal is not of interest in front of global Pareto optimal. 

3) As illustrated in both Fig. |3(a)| and Fig. |3(b)[ an even 
distribution of weight a between [0, 1] do not lead to an 
even spread of the solutions on the approximated Pareto 
front. Moreover, the nonconvex part of the Pareto front 
cannot be attained using any weight. It is exactly the case 
in Fig. |3(b)[ in Fig. |3(a)[ a trivial nonconvex part between 
a = 0.3 and a = 0.5 on the approximated Pareto front 
is probably resulted from the nonoptimal solution of the 
suboptimization problem. These are two main drawbacks 
of the sum-weighted method. 

Nevertheless, the obtained approximation of Pareto front is 
of high value. On one hand, it provides a set of Pareto optimal 
solutions for the DM, instead of a single decomposition. On 
the other hand, an insight of the trade-off between objectives 
Jx and Ju reveals the underlying linearity/nonlinearity of the 
data under study, as illustrated in the following section. 


C. Performance 

In this section, we study the performance of the method on 
the unmixing problem in hyperspectral imagery. The unmixing 
performance is evaluated by two metrics introduced in (29). 
The first one is the reconstruction error in the input space (RE) 
defined by 


RE = 


N 


l 

TL 


t= 1 


N 

Y. a nt e t || 2 . 

71=1 


2 For example, the Pareto optimal solutions for the well-known Schaffer’s 
function, defined by J(x) = [x 2 , (x — 2) 2 ] T , are found only within the 
interval [0,2], where a tradeoff between two objectives exists. See [331 for 
more details. 
































The second one is the reconstruction error in the feature space 
(RE $ ), which is similarly defined as 


TABLE II: Unmixing performance for the Urban image 
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and «(•,•) denotes the Gaussian kernel. It is worth to note 
that RE $ can always be evaluated for any given matrices E 
and A, regardless of the optimization problem and the solving 
procedure that led to these matrices. 


State-of-the-art unmixing methods 

An unmixing problem comprises the estimation of endmem- 
bers and the corresponding abundance maps. Some existing 
techniques either extract the endmembers (such as VCA) or 
estimate the abundances (such as FCLsE other methods en¬ 
able the simultaneous estimations, e.g., NMF and its variants. 
We briefly present all the unmixing algorithms that are used 
in comparison. 

The most-known endmember extraction technique is the 
vertex component analysis (VCA) |[28l . It is based on the 
linear mixture model and presumes the existence of end- 
members within the image under analysis. It seeks to inflate 
the simplex enclosing all the spectra. The endmembers are 
the vertices of the largest simplex. This technique is applied 
for endmember extraction, jointly with three abundance es¬ 
timation techniques: FCLS, K-Hype and GBM-sNMF. The 
fully constrained least squares algorithm (FCLS) EQ is a 
least square approach using the linear mixture model, where 
the abundances are estimated considering the nonnegativity 
and sum-to-one constraints. A nonlinear unmixing model 
for abundance estimation is considered in EU, where the 
nonlinear term is described as a kernel-based model, with 
the so-called linear-mixture/nonlinear-fluctuation model (K- 
Hype). In ||52l . a generalized bilinear model is formulated, 
with parameters optimized using the semi-nonnegative matrix 
factorization (GBM-sNMF). 

We further consider five NMF-based techniques that are 
capable to estimate the endmembers and abundances jointly. 
The minimum dispersion constrained NMF (MinDisCo) m 
includes the dispersion regularization to the conventional 
NMF, by integrating the sum-to-one constraint for each pixel’s 
abundance fractions and the minimization of variance within 
each endmember. The problem is solved by exploiting an 
alternate projected gradient scheme. In the convex nonnega¬ 
tive matrix factorization (ConvexNMF) ED, the basic matrix 
(endmember matrix in our context) is restricted to the span 
of the input data, that is, each sample can be viewed as 

3 See (50] for connections between the endmember extraction techniques 
and the abundances estimation techniques. 





RE xio -2 

RE^xio- 2 


FCLS 

1.44 

3.89 


GBM-sNMF 

6.50 

4.11 


K-Hype 

5.99 

4.67 


MinDisCo 

3.12 

4.60 


ConvexNMF 

2.96 

5.84 


KconvexNMF 

- 

43.94 


KsNMF 

- 

4.33 


MercerNMF 

- 

2.96 


LinearNMF 

a = 1 

1.48 

3.96 

s 

§- 

GaussianNMF 

a = 0 

3.49 

1.39 

a. 


a = 0.18 

2.70 

1.27 

2 

Pareto Optimal 

a = 0.50 

2.38 

2.04 



a = 0.94 

1.40 

3.78 


TABLE III: Unmixing performance for the Cuprite image 
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FCLS 

0.95 

0.59 


GBM-sNMF 

1.06 

0.62 


K-Hype 

2.12 

0.93 


MinDisCo 

1.62 

4.54 


ConvexNMF 

1.61 

2.51 


KconvexNMF 

- 

19.53 


KsNMF 

- 

1.38 


MercerNMF 

- 

2.74 


LinearNMF 

a = 1 

0.89 

2.28 

s 

§• 

Oh 

GaussianNMF 

a = 0 

1.05 

0.50 


a = 0.04 

0.92 

0.42 

2 

ParetoOptimal 

a = 0.50 

0.84 

0.58 



a = 0.72 

0.77 

0.73 


a convex combination of certain data points. The kernel 
convex-NMF (KconvexNMF) and the kernel semi-NMF based 
on the nonnegative least squares (KsNMF) are essentially 
the kernelized variants of the ConvexNMF in ED and the 
alternating nonnegativity constrainted least squares and the 
active set method in | |53l , respectively, as discussed in |22|. 
Experiments are also conducted with these two kernel meth¬ 
ods, adopting the Gaussian kernel. Nonlinear NMF based on 
constructing Mercer kernels (MercerNMF), introduced in (54l . 
addresses the nonlinear NMF problem using a self-constructed 
Gaussian kernel, where the nonnegativity of the embedded 
bases and coefficients is preserved. The embedded data are 
finally factorized with conventional NMF. Of particular note 
is that only the reconstruction error in the feature space can be 
calculated for the aforementioned kernel-based methods, since 
the pre-images of the mapped endmember, which are required 
in the computing of reconstruction error in the input space, 
cannot be exploited. 

Unmixing performance 

The unmixing performance, with respect to the reconstruc¬ 
tion errors in the input and the feature spaces, is compared with 
the aforementioned unmixing approaches, as demonstrated in 
TABLE HI1 and TABLE Hill As can be observed, the proposed 
method with Pareto optimal solution outperforms not only the 
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(a) a = 1 








Fig. 4: Visualization of the trade-off between the two objec¬ 
tives Jx and . ly , and the change of the aggregated objective 
function J, along with the increment of weight a, for the 
Urban image and the Cuprite image. 


(c) a = 0 

Fig. 5: Urban image: Endmembers and corresponding abun¬ 
dance maps, estimated using a = 1 (conventional linear 
NMF); a = 0.3 (a Pareto optimal of the bi-objective NMF); 
a = 0 (nonlinear Gaussian NMF). 


existing linear NMF (a = 1) and Gaussian NMF (a = 0), but 
also all the state-of-the-art methods. 

The estimated endmembers and the corresponding abun¬ 
dance maps with the proposed method are shown in Fig. [5] 
and Fig. [6] For the Urban image, different areas (the road in 
particular) are better recognized with the Pareto optimal when 
compared with solutions of the linear and of the Gaussian 
NMF. Regarding the Cuprite image, the linear NMF (i.e., 
a = 1) recognizes two out of three regions; whereas both 
the Pareto optimal (i.e., a = 0.72) and the Gaussian NMF 
(i.e., a = 0) are able to distinguish three regions. However, 
the abundance maps of Gaussian NMF appear to be overly 
sparse, compared with its counterpart of the Pareto optimal 
solution. It is also noticed that the endmembers extracted with 
the linear NMF are spiky, and even with some zero-parts, thus 
meeting poorly the real situation. 


VI. Conclusion 

This paper presented a novel bi-objective nonnegative ma¬ 
trix factorization by exploiting the kernel machines, where the 
decomposition was performed simultaneously in the input and 
the feature space. The multiplicative update rules were derived. 
The performance of the method was demonstrated for un¬ 
mixing well-known hyperspectral images. The resulting Pareto 
fronts were analyzed. As for future work, we are extending this 
approach to include other NMF objective functions, defined 
in the input or the feature space. Considering simultaneously 
several kernels, and as a consequence several feature spaces, 
is also under investigation. 
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Fig. 6: Cuprite image: Endmembers and corresponding abun¬ 
dance maps, estimated using a = 1 (conventional linear 
NMF); a = 0.72 (a Pareto optimal of the bi-objective NMF); 
a = 0 (nonlinear Gaussian NMF). 
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